FPGA Co-Processor For Accelerated Computation

ABSTRACT

A co-processor module for accelerating computational performance includes a Field Programmable Gate Array (“FPGA”) and a Programmable Logic Device (“PLD”) coupled to the FPGA and configured to control start-up configuration of the FPGA. A non-volatile memory is coupled to the PLD and configured to store a start-up bitstream for the start-up configuration of the FPGA. A mechanical and electrical interface is for being plugged into a microprocessor socket of a motherboard for direct communication with at least one microprocessor capable of being coupled to the motherboard. After completion of a start-up cycle, the FPGA is configured for direct communication with the at least one microprocessor via a microprocessor bus to which the microprocessor socket is coupled.

This application is a continuation of U.S. patent application Ser. No.11/829,801, filed Jul. 27, 2007, which claims benefit to U.S.provisional patent application No. 60/820,730, entitled “FPGACo-Processor for Accelerated Computation,” filed Jul. 28, 2006, each ofthe disclosures of which is herein incorporated by reference in itsentirety for all purposes.

FIELD

One or more embodiments generally relate to accelerators and, moreparticularly, to a co-processor module including a Field ProgrammableGate Array (“FPGA”).

BACKGROUND

Co-processors have often been used to accelerate computationalperformance. For example, early microprocessors were unable to includefloating-point computation circuitry due to chip area limitations. Doingfloating-point computations in software is extremely slow so thiscircuitry was often placed in a second chip which was activated whenevera floating-point computation was required. As chip technology improved,the microprocessor chip and the floating-point co-processor chip werecombined together.

A similar situation occurs today with specialized computationalalgorithms. Standard microprocessors do not include circuitry forperforming these algorithms because they are often specific to only afew users. By using an FPGA (field programmable gate-array) as aco-processor, an algorithm can be designed and programmed into hardwareto build a circuit that is unique for each application, resulting in asignificant acceleration of the desired computation.

SUMMARY

One or more embodiments generally relate to accelerators and, moreparticularly, to a co-processor module including a Field ProgrammableGate Array (“FPGA”).

A co-processor module for accelerating computational performanceincludes a Field Programmable Gate Array (“FPGA”) and a ProgrammableLogic Device (“PLD”) coupled to the FPGA and configured to controlstart-up configuration of the FPGA. A non-volatile memory is coupled tothe PLD and configured to store a start-up bitstream for the start-upconfiguration of the FPGA. A mechanical and electrical interface is forbeing plugged into a microprocessor socket of a motherboard for directcommunication with at least one microprocessor capable of being coupledto the motherboard. After completion of a start-up cycle, the FPGA isconfigured for direct communication with the at least one microprocessorvia a microprocessor bus to which the microprocessor socket is coupled.

BRIEF DESCRIPTION OF THE DRAWINGS

Accompanying drawing(s) show exemplary embodiment(s) in accordance withone or more embodiments; however, the accompanying drawing(s) should notbe taken to limit the invention to the embodiment(s) shown, but are forexplanation and understanding only.

FIG. 1 is a diagram of an exemplary co-processor module which may becoupled to a motherboard with two processor sockets, according to oneembodiment.

FIG. 2 is a block diagram of an exemplary co-processor module, includingmajor components and busses, according to one embodiment.

FIG. 3 is a block diagram of an exemplary layout of internal functionsof the co-processor FPGA, according to one embodiment.

FIG. 4 is a diagram of an exemplary expanded co-processor module with adaughter card containing additional logic functions, according to oneembodiment.

FIG. 5 is a flowchart showing a method for partially or fullyreprogramming a co-processor module from SRAM, according to oneembodiment.

FIG. 6 is a flowchart showing a method for creating co-processorconfiguration to accelerate a specific algorithm, according to oneembodiment.

DETAILED DESCRIPTION

In the following description, numerous specific details are set forth toprovide a more thorough description of the specific embodiments of theinvention. It should be apparent, however, to one skilled in the art,that the invention may be practiced without all the specific detailsgiven below. In other instances, well-known features have not beendescribed in detail so as not to obscure the invention. For ease ofillustration, the same number labels are used in different diagrams torefer to the same items; however, in alternative embodiments the itemsmay be different. Furthermore, although particular integrated circuitparts are described herein for purposes of clarity by way of example, itshould be understood that the scope of the description is not limited tothese particular numerical examples as other integrated circuit partsmay be used.

A multi-processor system consists of several processing chips connectedto each other by high-speed busses. By replacing one or more of theseprocessor chips by application-specific co-processors, it is oftenpossible to obtain a significant acceleration in computational speed.Each co-processor sits in the motherboard socket designed for a standardprocessor and makes use of motherboard resources.

According to one embodiment, the co-processor FPGA is located on amodule which plugs into a standard microprocessor socket. Motherboardsare commonly available which have multiple microprocessor sockets,allowing one or more standard microprocessors to co-exist with one ormore co-processor modules. Thus, no changes to the motherboard or othersystem hardware are required, making it easy to build co-processorsystems. The co-processor has access to motherboard resources includinglarge amounts of memory. These resources need not be duplicated on theco-processor module, reducing the cost, size and power requirements forthe co-processor. The co-processor is connected to the main processor byone or more high-speed low-latency busses. Many algorithms requirefrequent communication between the main microprocessor and theco-processor, making this interface a factor in achieving highperformance.

According to another embodiment, to accelerate computational algorithms,a co-processor module is included which plugs into a standardmicroprocessor socket on a motherboard and communicates with themicroprocessor by one or more high-speed, low-latency busses. Theco-processor has access to motherboard resources through themicroprocessor socket. The co-processor includes an FPGA which isreconfigurable and may be loaded with a new configuration patternsuitable for a different algorithm under control of the microprocessor.The configuration pattern is developed using a set of software tools.The co-processor module capabilities may be extended by addingadditional piggyback cards.

An another embodiment is an accelerator module, including an FPGA and aProgrammable Logic Device (“PLD”) coupled to the FPGA and configured tocontrol start-up configuration of the FPGA. A non-volatile memory iscoupled to the PLD and configured to store a start-up bitstream for thestart-up configuration of the FPGA. A mechanical and electricalinterface is configured for being plugged into a microprocessor socketof a motherboard for direct communication with at least onemicroprocessor capable of being coupled to the motherboard. Aftercompletion of a start-up cycle, the FPGA is configured for directcommunication with the at least one microprocessor via a microprocessorbus to which the microprocessor socket is coupled.

Another embodiment generally is an accelerator system, comprising afirst motherboard having accelerator modules and a second motherboardhaving at least one microprocessor. Each of the accelerator modulesincludes an FPGA and a Programmable Logic Device (“PLD”) coupled to theFPGA and configured to control start-up configuration of the FPGA. Anon-volatile memory is coupled to the PLD and configured to store astart-up bitstream for the start-up configuration of the FPGA. Amechanical and electrical interface is configured for being plugged intoa microprocessor socket of the first motherboard for directcommunication as between the accelerator modules. The microprocessorsocket is coupled to a microprocessor bus for the direct communicationbetween the accelerator modules.

Yet another embodiment generally is a method for co-processing. Anaccelerator module is coupled to a microprocessor bus, the acceleratormodule including a Field Programmable Gate Array (“FPGA”). Amicroprocessor bus interface bitstream is loaded into the FPGA toprogram programmable logic thereof. Data is transferred to first memoryof the accelerator module via a microprocessor bus using amicroprocessor bus interface instantiated in the FPGA responsive to themicroprocessor bus interface bitstream. A default configurationbitstream stored in the first memory is instantiated in the FPGA toconfigure the FPGA to have the microprocessor bus interface withsufficient functionality to be recognized by a microprocessor coupled tothe microprocessor bus.

Still yet another embodiment generally is another method forco-processing. An accelerator module, which includes a FieldProgrammable Gate Array (“FPGA”) and first memory, is coupled to amicroprocessor bus. The first memory has a default configurationbitstream stored therein. The default configuration bitstream is loadedinto the FPGA to program programmable logic thereof. The defaultconfiguration bitstream includes a microprocessor bus interface. TheFPGA is configured with the default configuration bitstream withsufficient functionality to be recognized by a microprocessor coupled tothe microprocessor bus.

Referring to FIG. 1, a multiprocessor motherboard 10 is shown containingtwo processor chips 100 and 101 and DRAM modules 104 and 105. In oneembodiment, the processor chips are Opteron microprocessors availablefrom Advanced Micro Devices (AMD) although processors available fromother companies such as Intel could also be used. A typical motherboardalso contains many other components which are omitted here for clarity.In one embodiment, the K8SRE (S2891) motherboard from Tyan ComputerCorporation is used although many other suitable motherboards areavailable from this and other vendors. Motherboards are available withvarious numbers of processor chips 100, 101. Typically, a motherboardcontains between one and eight processor chips. In one embodiment, amotherboard with sockets for at least two processor chips is required.One or more processor chips 100, 101 are removed and replaced withco-processor modules 200. If the motherboard contains more than twoprocessor chips, several of them may be replaced with co-processormodules 200 providing that at least one processor chip remains on themotherboard.

It is also possible to build high performance computing systems withmultiple motherboards interconnected by high speed busses. In such asystem, some of the motherboards may contain only co-processor moduleswhile other motherboards contain only processor chips or a mixture ofprocessor chips and co-processor modules. In such a multi-board system,there must be at least one processor chip in order to communicate withone or more co-processor modules.

Returning now to FIG. 1, processor chips 100, 101 are attached tomotherboard 10 using sockets 102, 103 which allow them to be easilyremoved. Co-processor module 200 has the same mechanical and electricalinterface via circuit board 299 and pins 298 as processor chips 100, 101allowing easy replacement with minimal or no changes to motherboard 10.Motherboard 10 also contains memory modules 104 which are normallycoupled for communication with a processor chip 100 plugged in socket102. Memory modules 105 are similarly coupled for communication with aprocessor chip 101 plugged in socket 103. When processor chip 100 isreplaced by co-processor 200, co-processor 200 has access to memorymodules 104.

Referring now to FIG. 2, a block diagram of co-processor module 200 isshown in more detail, along with its connections to motherboard 10.Co-processor module 200 contains FPGA (field-programmable gate array)201, SRAM (static random access memory) 202, PLD (programmable logicdevice) 203 and flash memory 204, along with other components such asresistors, capacitors, buffers and oscillators which have been omittedfor clarity. In one embodiment, FPGA 201 is an XC4VLX60FF668 availablefrom Xilinx corporation although there are numerous FPGAs available fromXilinx and other vendors such as Altera which would also be suitable.SRAM 202 may be a IDT71T75602S20BG from Integrated Device Technologycorporation, PLD 203 may be an EPM7256BUC169 from Altera corporation andflash memory 204 may be a TC58FVM5T2AXB65 from Toshiba corporation,according to one embodiment. In each case, there are numerousalternative components which could be used instead. FPGA 201 isconnected through bus 211 and socket 102 to the motherboard memorymodule 104. It is also connected through bus 210 and socket 102 to theremaining motherboard processor chip 101. In one embodiment, bus 210 isa hypertransport bus. The hypertransport bus has high bandwidth and lowlatency characteristics for example with respect to availability toprocessor 101, although other busses such as PCI, PCI Express or RapidIOcould be used instead with the appropriate motherboard components. Thehypertransport bus, which is a point-to-point bus, also forms a directconnection between processor 101 and co-processor module 200 withoutpassing through any intermediate chips or busses. This direct connectiongreatly improves throughput and latency when transferring data to theco-processor.

FPGA 201 also connects to SRAM 202 and PLD 203 via bus 214. PLD 203additionally connects to flash memory 204 via bus 213 and to FPGA 201via programming signals 212.

Referring now to FIG. 3, the internal logic of FPGA 201 is described. AnFPGA is a device which may be programmed to perform various logicalfunctions. FPGA 201 is reprogrammable so it may perform a first set oflogical functions, then, after reprogramming, a second set of logicalfunctions. This allows different algorithms to be programmed dependingon the needs of a particular customer or application. The logicalfunction of FPGA 201 is divided into two portions. Customer-specificalgorithms are programmed into the user logic section 306 of FPGA 201.In addition to user logic 306, the FPGA includes a set of interface orsupport functions 300. In one embodiment, these support functions 300are: a hypertransport interface 301, a DDR (double data-rate) DRAM(dynamic random-access memory) interface 302, a static RAM (randomaccess memory) interface 303 and a DMA and arbitration function 304.These support functions 300 are connected to user logic 306 by standardwrapper interface 305. The wrapper interface 305 is designed to presenta consistent view of support functions 300 so additional functions maybe added or functions may be changed internally without the need tochange user logic 306. The user logic portion of FPGA 201 may also bereprogrammed to represent different algorithms while the supportfunctions 300 continue to operate. This is necessary since manyfunctions such as hypertransport interface 301 and DDR memory interface302 cannot be interrupted without a long restart procedure.

The physical size of module 200 is limited because of the need to fitinto socket 102 without interfering with other components which mayexist on motherboard 10. At the same time, it is desirable to be able toexpand the functionality of module 200 to support various applications.Expanded functionality may include, for example, additional memory oradditional hypertransport interfaces. FIG. 4 shows how module 200 may beexpanded by adding a daughter card 400 which includes additionalcomponents. The daughter card 400 is attached to module 200 byconnectors 401,402.

Referring now to FIG. 5, the process of configuring FPGA 201 on module200 is described with renewed reference to FIGS. 1-3. When power isinitially supplied or the processor reset signal is applied, FPGA 201 isprogrammed automatically from flash memory 204. FPGA 201 may also bereprogrammed automatically from flash memory 204 if it ceases to operatedue to various conditions. Monitor logic is built into FPGA 201 and PLD203 which checks for correct operation of FPGA 201 and initiatesreprogramming if it senses a fault condition. The programming andreprogramming processes are controlled by PLD 203. Xilinx and otherssupply logic circuits and detailed instructions for programming an FPGAfrom a flash memory. In order to initially program flash memory 204, aconfiguration pattern is loaded into FPGA 201 using a JTAG connector onmodule 200. This configuration pattern is sufficient to operatehypertransport interface 301. Hypertransport interface 301 is then usedto transfer data to flash memory 204 under control of PLD 203. Flashmemory 204 normally contains a default FPGA configuration for supportfunctions 300 that is sufficient to operate the hypertransport interface301, memory interfaces 302, 303 and DMA and arbitration function 304 butdoes not include configuration information for user logic 306. PLD 203is initially configured using a JTAG (Joint Test Action Group standard1149.1) connector on module 200. Alternatively, flash memory 204 and PLD203 may be initially loaded with a default configuration before beingsoldered onto module 200. Flash memory 204 and PLD 203 may be reloadedwhile FPGA 201 is operating, by transferring new data overhypertransport interface 301. Flash memory 204 is intended to providesemi-permanent storage for the default FPGA configuration and is changedinfrequently. PLD 203 provides basic support functions for module 200and is also changed infrequently.

Once the default configuration pattern (bitstream) is loaded into FPGA201, module 200 becomes visible over the hypertransport bus to a mainprocessor 101 in the system. At 501, the main processor transfers a newconfiguration pattern over hypertransport bus 210 for writing to FPGA201 of module 200. This new configuration pattern typically contains auser logic function 306 and may also contain new definitions for supportfunctions 300. At 502, FPGA 201 of module 200 saves the newconfiguration pattern into either SRAM or DRAM using the memoryinterfaces 302 or 303. If full reconfiguration of FPGA 201 is planned,the configuration pattern must be saved into SRAM. DRAM cannot be usedfor full reconfiguration because the configuration data would be lostwhen DRAM interface 302 ceases to operate during the configurationprocess. SRAM may be controlled using PLD 203 instead of SRAM interface303 in FPGA 201 so the configuration data is retained while FPGA 201 isreprogrammed. The processors 501 and 502 may operate concurrently sincethe amount of data required to configure. FPGA 201 may be very large. At503, main processor 101 uses the hypertransport bus to send FPGA 201 ofmodule 200 the address of the configuration pattern in SRAM or DRAM,along with a command to reprogram itself. A decision 506 is then madewhether to do full or partial reconfiguration.

During partial reconfiguration, support functions 300 remain active andonly enough data must be transferred over hypertransport bus 210 toconfigure user logic 306. This allows partial reconfiguration to be muchfaster than full reconfiguration, making partial reconfiguration thepreferred alternative in most situations. Data for partialreconfiguration may be saved in either DRAM or SRAM. When module 201 isused to accelerate computational algorithms, frequent reconfiguration isoften necessary and reconfiguration time becomes a limiting factor indetermining the amount of acceleration that may be obtained. Partialreconfiguration at 505 involves FPGA 201 loading the reconfigurationdata, where an internal memory interface of FPGA 201 is used to read abitstream and pass it to user logic 306. After loading is complete, newlogic functions specified by the new configuration become active and maybe used.

If full reconfiguration is desired at 504 of FIG. 5 PLD 203 takes overcontrol of SRAM 202, erases FPGA 201 and transfers a complete newconfiguration pattern to FPGA 201. This is similar to initialprogramming except that the configuration data comes from SRAM 202instead of flash memory 204

With additional reference to FIG. 6, the process of generating userlogic 306 is described. Co-processor module 200 may acceleratecomputational algorithms. These algorithms are typically described in acomputer language such as C. Unfortunately, the C language is designedto execute on a sequential processor such as the Opteron from AMD or thePentium from Intel. Using an FPGA co-processor directly to execute analgorithm described in the C language would offer little or noacceleration since it would not utilize the primary advantages of theco-processor. The primary advantages of an FPGA co-processor compared toa sequential processor are a vast amount of parallelism and apotentially much higher memory bandwidth. In order to use the FPGAefficiently, the initial C description must be translated into ahardware description language (“HDL”), such as VHDL or Verilog. This isshown in 601 of FIG. 6. Tools are available from companies such asCeloxica that do this translation. Additionally, there are variations ofthe C language such as UPC (unified parallel C) in which someparallelism is made visible to the user. These dialects of C may betranslated more efficiently into FPGA co-processors.

At 602, constraints are generated for the user design. These includeboth physical and timing constraints. Physical constraints are necessaryto ensure that user logic 306 connects correctly and does not conflictwith support functions 300. Timing constraints determine the operatingspeed of user logic 306 and prevent other potential timing problems suchas race conditions.

At 603, user logic 306 is synthesized. Synthesis converts the designfrom an HDL description to a netlist of FPGA primitives. The Xilinx toolXST may be used.

At 604, the user logic 306 is combined with the pre-designed supportfunctions 300. The support functions 300, as well as wrapper interface305 associated therewith, have a pre-assigned fixed placement so theymay be combined with arbitrary user logic without affecting operation ofsupport functions 300. Sections of the support functions 300 are verysensitive to timing and correct operation could not be guaranteedwithout fixing the placement.

At 605, the design for instantiation in user logic 306 is placed androuted. Placement and routing is performed by the appropriate FPGAsoftware tools. These are available from the FPGA vendor. Constraintsgenerated at 602 guide the place and route 605 as well as synthesis 603to ensure that the desired speed and functionality are achieved.

At 606 a full or partial configuration pattern (or bitstream) for theFPGA is generated. This may be performed by a tool supplied by the FPGAvendor. The bitstream is then ready for download into co-processor FPGA201.

While the foregoing describes exemplary embodiment(s) in accordance withone or more embodiments, other and further embodiment(s) in accordancewith the one or more embodiments may be devised without departing fromthe scope thereof, which is determined by the claim(s) that follow andequivalents thereof. Claim(s) listing steps do not imply any order ofthe steps. Trademarks are the property of their respective owners.

1. An accelerator module, comprising: a Field Programmable Gate Array(“FPGA”) a Programmable Logic Device (“PLD”) coupled to the FPGA andconfigured to control start-up configuration of the FPGA; a non-volatilememory coupled to the PLD and configured to store a start-up bitstreamfor the start-up configuration of the FPGA; and a mechanical andelectrical interface for being plugged into a microprocessor socket of amotherboard for direct communication with at least one microprocessorcapable of being coupled to the motherboard; the FPGA after completionof a start-up cycle being configured for direct communication with theat least one microprocessor via a microprocessor bus to which themicroprocessor socket is coupled.
 2. The accelerator module according toclaim 1, wherein the microprocessor bus is a point-to-point bus.
 3. Theaccelerator module according to claim 2, wherein the FPGA aftercompletion of the start-up cycle is configured for direct communicationwith resources associated with the motherboard in addition to the atleast one microprocessor, wherein the resources are directly accessibleby the FPGA via the point-to-point bus, the point-to-point bus being aHypertransport bus.
 4. The accelerator module according to claim 3,wherein the FPGA after completion of the start-up cycle is furtherconfigured for direct communication via a dedicated bus with dynamicrandom access memory forming a portion of the resources associated withthe motherboard.
 5. The accelerator module according to claim 2, whereinthe FPGA after completion of the start-up cycle is further configuredfor direct communication with resources associated with the motherboardin addition to the at least one microprocessor, wherein the resourcesinclude random access memory which is directly accessible by the FPGAvia a dedicated memory bus.
 6. The accelerator module according to claim5, wherein the random access memory is Dynamic Random Access Memory(“DRAM”).
 7. The accelerator module according to claim 1, wherein theFPGA after completion of the start-up cycle is configured for directcommunication with system memory coupled to the motherboard which isassociated with the microprocessor point-to-point bus to which themicroprocessor socket is coupled.
 8. The accelerator module according toclaim 1, further comprising Static Random Access Memory (“SRAM”) coupledto the FPGA and configured for storing configuration information forconfiguring at least a user programmable logic portion of the FPGA.9-32. (canceled)
 33. An accelerator system, comprising: a firstmotherboard having accelerator modules; a second motherboard having atleast one microprocessor; each of the accelerator modules including: aField Programmable Gate Array (“FPGA”) a Programmable Logic Device(“PLD”) coupled to the FPGA and configured to control start-upconfiguration of the FPGA; a non-volatile memory coupled to the PLD andconfigured to store a start-up bitstream for the start-up configurationof the FPGA; and a mechanical and electrical interface configured forbeing plugged into a microprocessor socket of the first motherboard fordirect communication as between the accelerator modules; themicroprocessor socket being coupled to a microprocessor bus for thedirect communication between the accelerator modules.
 34. Theaccelerator system according to claim 33, wherein the microprocessor busis a point-to-point bus.
 35. The accelerator system according to claim34, wherein the microprocessor bus is a Hypertransport bus.